Performance Optimization

Micro-optimizations on modern CPUs

Easyperf

High-level analysis

Flame graph

CPU cache and memory access

Basic cache architecture

  • all memory access through cache
  • L1/L2 cache per core
  • L3 cache shared between cores
  • NUMA adds another level

Set-associative cache

  • hierarchical access
  • cache line 64 bytes
  • set idx = address mod #sets
  • cache lines can conflict

AMD Ryzen 7 5850U

Common pitfalls

  • random access prevents pre-fetching
  • strided access pollutes cache lines
  • hyper threading potentially pollutes cache lines
  • different thread writes to my cache line (“false sharing”)
  • memory ordering in C++ (sequential consistency..)

CPU micro architecture

Pipeline model

Increased throughput by pipeline.

  1. Instruction fetch (IF)
  2. Instruction decode (ID)
  3. Execute (EXE)
  4. Memory access (MEM)
  5. Write back (WRI)

Execution is superscalar and out-of-order.

Latency of common operations

Latency: Time before next dependent instruction is issued.

Common pitfalls

  • Long dependency chains prevent out-of-order exection
  • Port contention
  • Branch misprediction
  • Aliasing prevents loop optimizations
  • Missed vectorization opportunities

Roofline analysis

Basic idea

Roofline plot: Plot arithmetic intensity vs. FLOP/s

\[\begin{equation} \mathrm{AI} = \frac{\mathrm{FLOP}}{\mathrm{byte}} \end{equation}\]

Premise: AI is algorithm property and hardware-independent (not true!)

Roofline analysis - continued

Rooflines depend on implementation of algorithm!

Example

void multiply(float *A, float *B, float *C) {
   for (size_t i = 0; i < N; ++i) {
      for (size_t j =0; j < N; ++j) {
         for (size_t k = 0; k < N; ++k) {
            // very naughty memory access pattern!
            C[i*N+j] += A[i*N+k] * B[k*N+j];
         }
      }
   }
}

Neglecting integer operations, we have

\[\begin{equation} \mathrm{AI} = \frac{2}{3} \end{equation}\]

My machine

setfos

Top-Down Microarchitecture Analysis

Performance metrics

  • Retired instructions: instructions that were executed and not thrown away (speculative instructions)
  • Cycles per instruction (CPI)
  • Instructions per cycle (IPC)
  • Pipeline slot occupation
  • Cache missses

Top-Down Microarchitecture Analysis

Setfos VTUNE TMAM